-
Notifications
You must be signed in to change notification settings - Fork 2
Concatenate with OME-Zarr v0.5 and sharding #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
As of 83cd243 converting a 282 GB dataset (325 GB decompressed) took 7 minutes on 2 nodes with 64 CPUs each. As a reference converting a 65 GB dataset (183 GB decompressed) takes 2 minutes on 16 CPUs when using thread parallelism. |
Ran into zarr-developers/zarr-python#3221. |
pyproject.toml
Outdated
"scikit-learn", | ||
] | ||
|
||
[project.optional-dependencies] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this any different than the one from PyPI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - I'm guessing no? @tayllatheodoro may know better
sbatch_filepath: str = None, | ||
sbatch_filepath: str | None = None, | ||
local: bool = False, | ||
block: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the motivation for including the block parameter? Was it useful during testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running locally there isn't a good way to check if the jobs (processes) have finished. It is also useful for testing.
Current plan is that this PR will be merged after czbiohub-sf/iohub#301, updating the iohub dependency to the main branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I was able to run biahub concatenate -c rechunk.yml -o test.zarr -sb sbatch.sh
on a dataset which hasn't been converted from OME-NGFF v0.4/Zarr V2 to OME-NGFF v0.5/Zarr V3 over here /hpc/projects/intracellular_dashboard/organelle_dynamics/rerun/2025_04_15_A549_H2B_CAAX_ZIKV_DENV/2-assemble/zarr-v3
.
Blocked until we bump waveorder:
|
Another blocker is |
Needs czbiohub-sf/iohub#311
To be investigated: multiprocessing based parallelism is not compatible with the asyncio-based thread parallelism that zarr-python is designed for and appears to be a bit slower.